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Abstract 

We describe a project to capitalize on newly avail- 
able levels of computational resources in order to f 
understand human cognition. We will build an in- 
tegrated physical system including vision, sound 
input and output, and dextrous manipulation, all 
controlled by a continuously operating large scale 
parallel MIMD computer. The resulting system 
will learn to “think” by building on its bodily 
experiences to accomplish progressively more ab- 
stract tasks. Past experience suggests that in at- i 
tempting to build such an integrated system we 
will have to fundamentally change the way artifi- ■ 
cial intelligence, cognitive science, linguistics, and 
philosophy think about the organization of intel- 
ligence, We expect to be able to better reconcile 5 
the theories that will be developed with current 
work in neuroscience. 

Project Overview 

We propose to build an integrated physical humanoid 
robot including active vision, sound input and out- 
put, dextrous manipulation, and the beginnings of lan- 
guage, all controlled by a continuously operating large 
scale parallel MIMD computer. This project will cap- 
italize on newly available levels of computational re- 
sources in order to meet two goals: an engineering goal 
of building a prototype general purpose flexible and 
dextrous autonomous robot and a scientific goal of un- 
derstanding human cognition. While there have been 
previous attempts at building kinematically humanoid 
robots, none have attempted the embodied construc- 
tion of an autonomous intelligent robot; the requisite 
computational power simply has not previously been 
available. 

The robot will be coupled into the physical world 
with high bandwidth sensing and fast servo- controlled 
actuators, allowing it to interact with the world on a 
human time scale. A shared time scale will open up 
new possibilities for how humans use robots as assis- 
tants, as well as allowing us to design the robot to 
learn new behaviors under human feedback such as 


human manual guidance and vocal approval. One of 
our engineering goals is to determine the architectural 
requirements sufficient for an enterprise of this type. 
Based on our earlier work on mobile robots, our ex- 
pectation is that the constraints may be different than 
those that are often assumed for large scale parallel 
computers. If ratified, such a conclusion could have 
important impacts on the design of future sub-families 
of large machines. 

Recent trends in artificial intelligence, cognitive sci- 
ence, neuroscience, psychology, linguistics, and sociol- 
ogy are converging on an anti-objectivist, body-based 
approach to abstract cognition. Where traditional ap- 
proaches in these fields advocate an objectively speci- 
fiable reality — brain-in-a-box, independent of bodily 
constraints — these newer approaches insist that intel- 
ligence cannot be separated from the subjective expe- 
rience of a body. The humanoid robot provides the 
necessary substrate for a serious exploration of the 
subjectivist — body-based — hypotheses. 

There are numerous specific cognitive hypotheses 
that could be implemented in one or more of the hu- 
manoids that will be built during the five-year project. 
For example, we can vary the extent to which the robot 
is programmed with an attention al preference for some 
images or sounds, and the extent to which the robot 
is programmed to learn to selectively attend to envi- 
ronmental input as a by-product of goal attainment 
(e.g., successful manipulation of objects) or reward by 
humans. We can compare the behavioral result of con- 
structing a humanoid around different hypotheses of 
cortical representation, such as coincidence detection 
versus interpolating memory versus sequence seeking 
in counter streams versus time-locked multi-regional 
retroactivation. In the later years of the project we 
can connect with theories of consciousness by demon- 
strating that humanoids designed to continuously act 
on immediate sensory data (as suggested by Dennett’s 
multiple drafts model) show more human-like behavior 
than robots designed to construct an elaborate world 
model. 

The act of building and programming behavior- 
based robots will force us to face not only issues of 
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interfaces between traditionally assumed modularities, 
but even the idea of modularity itself. By reaching 
across traditional boundaries and tying together many 
sensing and acting modalities, we will quickly illumi- 
nate shortcomings in the standard models, shedding 
light on formerly unrealized sociologically shared, but 
incorrect, assumptions. 

Background: the power of enabling 
technology 

An enabling technology — such as the brain that we will 
build — has the ability to revolutionize science. A re- 
cent example of the far-reaching effects of such techno- 
logical advances is the field of mobile robotics. Just as 
the advent of cheap and accessible mobile robotics dra- 
matically altered our conceptions of intelligence in the 
last decade, we believe that current high-performance 
computing technology makes the present an opportune 
time for the construction of a similarly significant in- 
tegrated intelligent system. 

Over the last eight years there has been a renewed 
interest in building experimental mobile robot systems 
that operate in unadorned and unmodified natural and 
unstructured environments. The enabling technology 
for this was the single chip micro- computer. This made 
it possible for relatively small groups to build service- 
able robots largely with graduate student power, rather 
than the legion of engineers that had characterized ear- 
lier efforts along these lines in the late sixties. The 
accessibility of this technology inspired academic re- 
searchers to take seriously the idea of building systems 
that would work in the real world. 

The act of building and programming behavior- 
based robots fundamentally changed our understand- 
ing of what is difficult and what is easy. The effects 
of this work on traditional artificial intelligence can 
be seen in innumerable areas. Planning research has 
undergone a major shift from static planning to deal 
with “reactive planning.” The emphasis in computer 
vision has moved from recovery from single images or 
canned sequences of images to active — or animate — 
vision, where the observer is a participant in the world 
controlling the imaging process in order to simplify the 
processing requirements. Generally, the focus within 
A I has shifted from centralized systems to distributed 
systems. Further, the work on behavior-based mobile 
robots has also had a substantial effect on many other 
fields (e.g., on the design of planetary science missions, 
on silicon micro-machining, on artificial life, and on 
cognitive science). There has also been considerable 
interest from neuroscience circles, and we are just now 
starting to see some bi-directional feedback there. 

The grand challenge that we wish to take up is to 
make the quantum leap from experimenting with mo- 
bile robot systems to an almost humanoid integrated 
head system with saccading foveated vision, facilities 
for sound processing and sound production, and a com- 
pliant, dextrous manipulator. The enabling technology 


is massively parallel computing; our brain will have 
large numbers of processors dedicated to particular 
sub-functions, and interconnected by a fixed topology 
network. 

Scientific Questions 

Building an android, an autonomous robot with hu- 
manoid form, has been a recurring theme in science 
fiction from the inception of the genre with Franken- 
stein, through the moral dilemmas infesting positronic 
brains, the human but not really human C3P0 and the 
ever present desire for real humanness as exemplified 
by Commander Data. Their bodies have ranged from 
that of a recycled actual human body through various 
degrees of mechanical sophistication to ones that are 
indistinguishable (in the stories) from real ones. And 
perhaps the most human of all the imagined robots, 
HAL-9000, did not even have a body. 

While various engineering enterprises have modeled 
their artifacts after humans to one degree or another 
(e.g., WABOT-II at Waseda University and the space 
station tele-robotic servicer of Martin-Marietta) no 
one has seriously tried to couple human like cognitive 
processes to these systems. There has been an im- 
plicit, and sometimes explicit, assumption, even from 
the days of Turing (see Turing (1970)*) that the ul- 
timate goal of artificial intelligence research was to 
build an android. There have been many studies relat- 
ing brain models to computers (Berkeley 1949), cyber- 
netics (Ashby 1956), and artificial intelligence (Arbib 
1964), and along the way there have always been semi- 
popular scientific books discussing the possibilities of 
actually building real dive’ androids (Caudill (1992) is 
perhaps the most recent). 

This proposal concerns a plan to build a series of 
robots that are both humanoid in form, humanoid in 
function, and to some extent humanoid in computa- 
tional organization. While one cannot deny the ro- 
mance of such an enterprise we are realistic enough to 
know that we can but scratch the surface of just a few 
of the scientific and technological problems involved in 
building the ultimate humanoid given the time scale 
and scope of our proposal, and given the current state 
of our knowledge. 

The reason that we should try to do this at all is 
that for the first time there is plausibly enough com- 
putation available. High performance parallel compu- 
tation gives us a new tool that those before us have 
not had available and that our contemporaries have 
chosen not to use in such a grand attempt. Our previ- 
ous experience in attempting to emulate much simpler 
organisms than humans suggests that in attempting 
to build such systems we will have to fundamentally 
change the way artificial intelligence, cognitive science, 
psychology, and linguistics think about the organiza- 

* Different sources cite 1947 and 1948 as the time of writ- 
ing, but it was not published until long after his death. 
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tion of intelligence. As a result, some new theories will 
have to be developed. We expect to be better able to 
reconcile the new theories with current work in neuro- 
science. The primary benefits from this work will be 
in the striving, rather than in the constructed artifact. 

Brains 

Our goal is to take advantage of the new availability of 
massively parallel computation in dedicated machines. 
We need parallelism because of the vast amounts of 
processing that must be done in order to make sense 
of a continuous and rich stream of perceptual data. 
We need parallelism to coordinate the many actuation 
systems that need to work in synchrony (e.g., the oc- 
ular system and the neck must move in a coordinated 
fashion at time to maintain image stability) and which 
need to be servoed at high rates. We need parallelism 
in order to have a continuously operating system that 
can be upgraded without having to recompile, reload, 
and restart all of the software that runs the stable lower 
level aspects of the humanoid. And finally we need par- 
allelism for the cognitive aspects of the system as we 
are attempting to build a “brain” with more capability 
than can fit on any existing single processor. 

But in Teal-time embedded systems there is yet an- 
other necessary reason for parallelism. It is the fact 
that there are many things to be attended to, hap- 
pening in the world continuously, independent of the 
agent. From this comes the notion of an agent be- 
ing situated in the world. Not only must the agent 
devote attention to perhaps hundreds of different sen- 
sors many times per second, but it must also devote 
attention “down stream” in the processing chain in 
many different places at many times per second as the 
processed sensor data flows through the system. The 
actual amounts of computation needed to be done by 
each of these individual processes is in fact quite small, 
so small that originally we formalized them as aug- 
mented finite state machines (Brooks 1986), although 
more recently we have thought of them as real-time 
rules (Brooks 1990a). They are too small to have a 
complete processor devoted to them in any machine 
beyond a CM-2, and even there the processors would 
be mostly idle. A better approach is to simulate par- 
allelism in a single conventional processor with its own 
local memory. 

For instance, Ferrell (1993) built a software system 
to control a 19 actuator six legged robot using about 60 
of its sensors. She implemented it as more than 1500 
parallel processes running on a single Phillips 68070. 
(It communicated with 7 peripheral processors which 
handled sensor data collection and 100Hz motor ser- 
voing.) Most of these parallel processes ran at rates 
varying between 10 and 25 Hertz. Each time each pro- 
cess ran, it took at most a few dozen instructions before 
blocking, waiting either for the passage of time or for 
some other process to send it a message. Clearly, low 
cost context switching was important. 


The underlying computational model used on that 
robot — and with many tens of other autonomous 
mobile robots we have built— consisted of networks 
of message-passing augmented finite state machines. 
Each of these AFSMs was a separate process. The 
messages were sent over predefined wires from a spe- 
cific transmitting to a specific receiving AFSM. The 
messages were simple numbers (typically 8 bits) whose 
meaning depended on the designs of both the transmit- 
ter and the receiver. An AFSM had additional registers 
which held the most recent incoming message on any 
particular wire. This gives a very simple model of par- 
allelism, even simpler than that of CSP (Hoare 1985). 
The registers could have their values fed into a local 
combinatorial circuit to produce new values for regis- 
ters or to provide an output message. The network of 
AFSMs was totally asynchronous, but individual AF- 
SMs could have fixed duration monostables which pro- 
vided for dealing with the flow of time in the outside 
world. The behavioral competence of the system was 
improved by adding more behavior-specific network to 
the existing network. This process was called layering. 
This was a simplistic and crude analogy to evolution- 
ary development. As with evolution, at every stage of 
the development the systems were tested. Each of the 
layers was a behavior-producing piece of network in 
its own right, although it might implicitly rely on the 
presence of earlier pieces of network. For instance, an 
explore layer did not need to explicitly avoid obstacles, 
as the designer knew that a previous avoid layer would 
take care of it. A fixed priority arbitration scheme was 
used to handle conflicts. 

On top of the AFSM substrate we used another 
abstraction known as the Behavior Language, or BL 
(Brooks 1990a), which was much easier for the user 
to program with. The output of the BL compiler 
was a standard set of augmented finite state machines; 
by maintaining this compatibility all existing software 
could be retained. When programming in BL the user 
has complete access to full Common Lisp as a meta- 
language by way of a macro mechanism. Thus the user 
could easily develop abstractions on top of BL, while 
still writing programs which compiled down to net- 
works of AFSMs. In a sense, AFSMs played the role of 
assembly language in normal high level computer lan- 
guages. But the structure of the AFSM networks en- 
forced a programming style which naturally compiled 
into very efficient small processes. The structure of the 
Behavior Language enforced a modularity where data 
sharing was restricted to smallish sets of AFSMs, and 
whose only interfaces were essentially asynchronous 1- 
deep buffers. 

In the humanoid project we believe much of the com- 
putation, especially for the lower levels of the system, 
will naturally be of a similar nature. We expect to 
perform different experiments where in some cases the 
higher level computations are of the same nature and 
in other cases the higher levels will be much more sym- 



bolic in nature, although the symbolic bindings will be 
restricted to within individual processors. We need to 
use software and hardware environments which give 
support to these requirements without sacrificing the 
high levels of performance of which we wish to make 
use. 

Software 

For the software environment we have a number of re- 
quirements: 

• There should be a good software development envi- 
ronment. 

♦ The system should be completely portable over 
many hardware environments, so that we can up- 
grade to new parallel machines over the lifetime of 
this project. 

# The system should provide efficient code for percep- 
tual processing such as vision. 

• The system should let us write high level symbolic 
programs when desired. 

♦ The system language should be a standardized lan- 
guage that is widely known and understood. 

In summary our software environment should let us 
gain easy access to high performance parallel compu- 
tation. 

We have chosen to use Common Lisp (Steele Jr. 
1990) as the substrate for all software development. 
This gives us good programming environments includ- 
ing type checked debugging, rapid prototyping, sym- 
bolic computation, easy ways of writing embedded lan- 
guage abstractions, and automatic storage manage- 
ment. We believe that Common Lisp is superior to 
C (the other major contender) in all of these aspects. 

The problem then is how to use Lisp in a massively 
parallel machine where each node may not have the 
vast amounts of memory that we have become accus- 
tomed to feeding Common Lisp implementations on 
standard Unix boxes. 

We have a long history of building high performance 
Lisp compilers (Brooks, Gabriel fc Steele Jr. 1982), 
including one of the two most common commercial Lisp 
compilers on the market; Lucid Lisp — Brooks, Posner, 
McDonald, White, Benson <§£ Gabriel (1986). 

Recently we have developed L (Brooks 1993), a re- 
targetable small efficient Lisp which is a downwardly 
compatible subset of Common Lisp. When compiled 
for a 68000 based machine the load image (without 
the compiler) is only 140K bytes, but includes mul- 
tiple values, strings, characters, arrays, a simplified 
but compatible package system, all the “ordinary” as- 
pects of format, backquote and comma, setf etc., 
full Common Lisp lambda lists including optionals and 
keyword arguments, macros, an inspector, a debug- 
ger, def struct (integrated with the inspector), block, 
catch, and throw, etc., full dynamic closures, a full 


lexical interpreter, floating point, fast garbage collec- 
tion, and so on. The compiler runs in time linear in 
the size of an input expression, except in the presence 
of lexical closures. It nevertheless produces highly op- 
timized code in most cases. L is missing flet and 
labels, generic arithmetic, bignums, rationals, com- 
plex numbers, the library of sequence functions (which 
can be written within L) and esoteric parts of format 
and packages. 

The L system is an intellectual descendent of the 
dynamically retargetable Lucid Lisp compiler (Brooks 
et al. 1986) and the dynamically retargetable Behav- 
ior Language compiler (Brooks 1990a). The system is 
totally written in L with machine dependent backends 
for retarget ting. The first backend is for the Motorola 
68020 (and upwards) family, but it is easily retargeted 
to new architectures. The process consists of writing a 
simple machine description, providing code templates 
for about 100 primitive procedures (e.g., fixed preci- 
sion integer +, *, =, etc., string indexing CHAR and 
other accessors, CAR, CDR, etc.), code macro expansion 
for about 20 pseudo instructions (e.g, procedure call, 
procedure exit, checking correct number of arguments, 
linking CATCH frames, etc.) and two corresponding sets 
of assembler routines which are too big to be expanded 
as code templates every time, but are so critical in 
speed that they need to be written in machine lan- 
guage, without the overhead of a procedure call, rather 
than in Lisp (e.g., CONS, spreading of multiple values 
on the stack, etc.). There is a version of the I/O system 
which operates by calling C routines (e.g., fgetchar, 
etc.; this is how the Macintosh version of L runs) so 
it is rather simple to port the system to any hardware 
platform we might choose to use in the future. 

Note carefully the intention here; L is to be the de- 
livery vehicle running on the brain hardware of the 
humanoid, potentially on hundreds or thousands of 
small processors. Since it is fully downward compat- 
ible with Common Lisp however, we can carry out 
code development and debugging on standard work 
stations with full programming environments (e.g., in 
Macintosh Common Lisp, or Lucid Common Lisp with 
Emacs 19 on a Unix box, or in the Harlequin program- 
ming environment on a Unix box). We can then dy- 
namically link code into the running system on our 
parallel processors. 

There are two remaining problems: (1) how to main- 
tain super critical real-time performance when using a 
Lisp system without hard ephemeral garbage collec- 
tion, and (2) how to get the level of within-processor 
parallelism described earlier. 

The structure of L’s implementation is such that 
multiple independent heaps can be maintained within 
a single address space, sharing all the code and data 
segments of the Lisp proper. In this way super-critical 
portions of a system can be placed in a heap where no 
consing is occurring, and hence there is no possibility 
that they will be blocked by garbage collection. 
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The Behavior Language (Brooks 1990a) is an exam- 
ple of a compiler which builds special purpose static 
schedulers for low overhead parallelism. Each process 
ran until blocked and the syntax of the language forced 
there to always be a blocking condition, so there was 
no need for pre-emptive scheduling. Additionally the 
syntax and semantics of the language guaranteed that 
there would be zero stack context needed to be saved 
when a blocking condition was reached. We will need 
to build a new scheduling system with L to address 
similar issues in this project. To fit in with the phi- 
losophy of the rest of the system it must be a dy- 
namic scheduler so that new processes can be added 
and deleted as a user types to the Lisp listener of a 
particular processor. Reasonably straightforward data 
structures can keep these costs to manageable levels. 

It is rather straightforward to build a phase into the L 
compiler which can recognize the situations described 
above. Thus it is straightforward to implement a set 
of macros which will provide a language abstraction 
on top of Lisp which will provide all the functionality 
of the Behavior Language and which will additionally 
let us have dynamic scheduling. Almost certainly a 
pre-emptive scheduler will be needed in addition, as 
it would be difficult to enforce a computation time 
limit syntactically when Common Lisp will essentially 
be available to the programmer — at the very least the 
case of the pre-emptive scheduler having to strike down 
a process will be useful as a safety device, and will also 
act as a debugging tool for the user to identify time 
critical computations which are stressing the bounded 
computation style of writing. In other cases static anal- 
ysis will be able to determine maximum stack require- 
ments for a particular process, and so heap allocated 
stacks will be usable. 1 

The software system so far described will be used 
to implement crude forms of ‘brain models', where 
computations will be organized in ways inspired by 
the sorts of anatomical divisions we see occurring in 
animal brains. Note that we are not saying we will 
build a model of a particular brain, but rather there 
will be a modularity inspired by such components as 
visual cortex, auditory cortex, etc., and within and 
across those components there will be further modu- 
larity, e.g., a particular subsystem to implement the 
vestibulo-ocular response (VOR). 

Thus besides on-processor parallelism we will need to 
provide a modularity tool that packages processes into 
groups and limits data sharing between them. Each 
package will reside on a single processor, but often pro- 
cessors will host many such packages. A package that 
communicates with another package should be insu- 
lated at the syntax level from knowing whether the 
other package is on the same or a different processor. 
The communication medium between such packages 

tThe problem with heap allocated stacks in the general 
case is that there will be no overflow protection into the 
rest of heap. 


will again be 1-deep buffers without queuing or receipt 
acknowledgment — any such acknowledgment will need 
to be implemented as a backward channel, much as we 
see throughout the cortex (Churchland & Sejnowski 
1992). This packaging system can be implemented in 
Common Lisp as a macro package. 

We expect all such system level software develop- 
ment to be completed in the first twelve months of the 
project. 

Computational Hardware 

The computational model presented in the previous 
section is somewhat different from that usually as- 
sumed in high performance parallel computer appli- 
cations. Typically (Cypher, Ho, Konstantinidou & 
Messina 1993) there is a strong bias on system re- 
quirements from the sort of benchmarks that are used 
to evaluate performance. The standard benchmarks 
for modern high performance computation seem to 
be Fortran codes for hydrodynamics, molecular sim- 
ulations, or graphics rendering. We are proposing a 
very different application with very different require- 
ments; in particular we require real-time response to 
a wide variety of external and internal events, we re- 
quire good symbolic computation performance, we re- 
quire only integer rather than high performance float- 
ing point operations,* we require delivery of messages 
only to specific sites determined at program design 
time, rather than at run-time, and we require the abil- 
ity to do very fast context switches because of the large 
number of parallel processes that we intend to run on 
each individual processor. 

The fact that we will not need to support pointer ref- 
erences across the computational substrate will mean 
that we can rely on much simpler, and therefore 
higher performance, parallel computers than many 
other researchers — we will not have to worry about 
a consistent global memory, cache coherence, or ar- 
bitrary message routing. Since these are different re- 
quirements than those that are normally considered, 
we have to make some measurements with actual pro- 
grams before we can we can make an intelligent off the 
shelf choice of computer hardware. 

In order to answer some of these questions w T e are 
currently building a zeroth generation parallel com- 
puter. It is being built on a very low budget with off 
the shelf components wherever possible (a few fairly 
simple printed circuit boards need to be fabricated). 
The processors are 16Mhz Motorola 68332s on a stan- 
dard board built by Vesta Technology. These plug 16 
to a backplane. The backplane provides each processor 
with six communications ports (using the integrated 
timing processor unit to generate the required signals 

* Consider the dynamic range possible in single signal 
channels in the human brain and it soon becomes apparent 
that all that we wish to do is certainly achievable with 
neither span of 600 orders of magnitude, or 4T significant 
binary digits. 


along with special chip select and standard address 
and data lines) and a peripheral processor port. The 
communications ports will be hand-wired with patch 
cables, building a fixed topology network. (The ca- 
bles incorporate a single dual ported RAM (8K by 16 
bits) that itself includes hardware semaphores writable 
and readable by the two processors being connected.) 
Background processes running on the 68332 operating 
system provide sustained rate transfers of 60Hz pack- 
ets of 4K bytes on each port, with higher peak rates if 
desired. These sustained rates do consume processing 
cycles from the 68332. On non-vision processors we 
expect much lower rates will be needed, and even on 
vision processors we can probably reduce the packet 
frequency to around 15Hz. Each processor has an op- 
erating system, L, and the dynamic scheduler residing 
in 1M of EPROM. There is 1M of RAM for program, 
stack and heap space. Up to 256 processors can be 
connected together. 

Up to 16 backplanes can be connected to a single 
front end processor (FEP) via a shared 500K baud se- 
rial line to a SCSI emulator. A large network of 68332s 
can span many FEPs if we choose to extend the con- 
struction of this zero-th prototype. Initially we will use 
a Macintosh as a FEP. Software written in Macintosh 
Common Lisp on the FEP will provide disk I/O ser- 
vices to the 68332’s, monitor status and health packets 
from them, and provide the user with a Lisp listener 
to any processor they might choose. 

The zero-th version uses the standard Motorola SPI 
(serial peripheral interface) to communicate with up 
to 16 Motorola 6811 processors per 68332. These are 
a single chip processor with onboard EEPROM (2K 
bytes) and RAM (256 bytes), including a timer system, 
an SPI interface, and 8 channels of analog to digital 
conversion. We are building a small custom board for 
this processor that includes opto-isolated motor drivers 
and some standard analog support for sensors^. 

We expect our first backplane to be operational by 
August 1st, 1993 so that we can commence experi- 
ments with our first prototype body. We will collect 
statistics on inter-processor communication through- 
put, effects of latency, and other measures so that we 
can better choose a larger scale parallel processor for 
more serious versions of the humanoid . 

In the meantime, however, there are certain devel- 
opments on the horizon within the MIT Artificial In- 
telligence Lab which we expect to capitalize upon in 
order to dramatically upgrade our computational sys- 
tems for early vision, and hence the resolution at which 
we can afford to process images in real time. The 

s We currently have 28 operational robots in our labs 
each with between 3 and 5 of these 6811 processors, and 
several dozen other robots with at least 1 such processor 
on board. We have great experience in writing compiler 
backends for these processors (including BL) and great ex- 
perience in using them for all sorts of servoing, sensor mon- 
itoring, and communications tasks. 


first of these, expected in the fall will be a some- 
what similar distributed processing system based on 
the much higher performance Texas Instrument C40, 
which comes with built in support for fixed topology 
message passing. We expect these systems to be avail- 
able in the Fall *93 timeframe. In October ’94 we ex- 
pect to be able to make use of the Abacus system, a 
bit level reconfigurable vision front-end processor being 
built under ARPA sponsorship which promises Tera-op 
performance on 16 bit fixed precision operands. Both 
these systems will be simply integrable with our zero-th 
order parallel processor via the standard dual-ported 
RAM protocol that we are using. 

Bodies 

As with the computational hardware, we are also cur- 
rently engaged in building a zero-th generation body 
for early experimentation and design refinement to- 
wards more serious constructions within the scope of 
this proposal. We are presently limited by budgetary 
constraints to building an immobile, armless, deaf, 
torso with only black and white vision. 

In the following subsections we outline the con- 
straints and requirements on a full scale humanoid 
body and also include where relevant details of our 
zero-th level prototype. 

Eyes 

There has been quite a lot of recent work on animate 
vision using saccading stereo cameras, most notably 
at Rochester (Ballard 1989), (Coombs 1992), but also 
more recently at many other institutions, such as Ox- 
ford University 

The humanoid needs a head with high mechanical 
performance eyeballs and foveated vision if it is to be 
able to participate in the world with people in a natu- 
ral way. Even our earliest heads will include two eyes, 
with foveated vision, able to pan and tilt as a unit, and 
with independent saccading ability (three saccades per 
second) and vergence control of the eyes. Fundamental 
vision based behaviors will include a visually calibrated 
vestibular-ocular reflex, smooth pursuit, visually cal- 
ibrated saccades, and object centered foveal relative 
depth stereo. Independent visual systems will provide 
peripheral and foveal motion cues, color discrimina- 
tion, human face pop-outs, and eventually face recogni- 
tion. Over the course of the project, object recognition 
based based on “representations” from body schemas 
and manipulation interactions will be developed. This 
is completely different from any conventional object 
recognition schemes, and can not be attempted with- 
out an integrated vision and manipulation environment 
as we propose. 

The eyeballs need to be able to saccade up to about 
three times per second, stabilizing for 250ms at each 
stop. Additionally the yaw axes should be control- 
lable for vergence to a common point and drivable in 
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a manner appropriate for smooth pursuit and for im- 
age stabilization as part of a vestibulo-ocular response 
(VOR) to head movement. The eyeballs do not need 
to be force or torque controlled but they do need good 
fast position and velocity control. We have previously 
built a single eyeball, A-eye , on which we implemented 
a model of VOR, ocular-kinetic response (OKR) and 
saccades, all of which used dynamic visually based cal- 
ibration (Viola 1990). 

Other active vision systems have had both eyeballs 
mounted on a single tilt axis. We will begin experi- 
ments with separate tilt axes but if we find that rela- 
tive tilt motion is not very useful we will back ofT from 
this requirement in later versions of the head. 

The cameras need to cover a wide field of view, 
preferably close to 180 degrees, while also giving a 
foveated central region. Ideally the images should be 
RGB (rather than the very poor color signal of stan- 
dard NTSC). A resolution of 512 by 512 at both the 
coarse and fine scale is desirable. 

Our zero-th version of the cameras are black and 
white only. Each eyeball consists of two small 
lightweight cameras mounted with parallel axes. One 
gives a 115 degree field of view and the other gives a 
20 degree foveated region. In order to handle the im- 
ages in real time in our zero-th parallel processor we 
will subsample the images to be much smaller than the 
ideal. 

Later versions of the head will have full RGB color 
cameras, wider angles for the peripheral vision, much 
finer grain sampling of the images, and perhaps a col- 
inear optics set up using optical fiber cables and beam 
splitters. With more sophisticated high speed process- 
ing available we will also be able to do experiments 
with log-polar image representations. 

Ears, Voice 

Almost no work has been done on sound understand- 
ing, as distinct from speech understanding. This 
project will start on sound understanding to provide 
a much more solid processing base for later work on 
speech input. Early behavior layers will spatially cor- 
relate noises with visual events, and spatial registra^- 
tion will be continuously self calibrating. Efforts will 
concentrate on using this physical cross-correlation as 
a basis for reliably pulling out interesting events from 
background noise, and mimicking the cocktail party ef- 
fect of being able to focus attention on particular sound 
sources. Visual correlation with face pop-outs, etc., 
will then be used to be able to extract human sound 
streams. Work will proceed on using these sounds 
streams to mimic infant’s abilities to ignore language 
dependent irrelevances. By the time we get to elemen- 
tary speech we will therefore have a system able to 
work in noisy environments and accustomed to multi- 
ple speakers with varying accents. 

Sound perception will consist of three high quality 
microphones. (Although the human head uses only 


two auditory inputs, it relies heavily on the shape of 
the external ear in determining the vertical component 
of directional sound source.) Sound generation will be 
accomplished using a speaker. 

Sound is critical for several aspects of the robot’s 
activity. First, sound provides immediate feedback for 
motor manipulation and positioning. Babies learn to 
find and use their hands by batting at and manipulat- 
ing toys that jingle and rattle. Adults use such cues 
as contact noises — the sound of an object hitting the 
table— to provide feedback to motor systems. Second, 
sound aids in socialization even before the emergence 
of language. Patterns such as turn-taking and mimicry 
are critical parts of children’s development, and adults 
use guttural gestures to express attitudes and other 
conversational cues. Certain signal tones indicate en- 
couragement or disapproval to all ages and stages of de- 
velopment. Finally, even pre-verbal children use sound 
effectively to convey intent; until our robots develop 
true language, other sounds will necessarily be a ma- 
jor source of communication. 

Torsos 

In order for the humanoid to be able to participate in 
the same sorts of body metaphors as are used by hu- 
mans, it needs to have a symmetric human-like torso. 
It needs to be able to experience imbalance, feel sym- 
metry, learn to coordinate head and body motion for 
stable vision, and be able to experience relief when it 
relaxes its body. Additionally the torso must be able 
to support the head, the arms, and any objects they 
grasp. 

The torsos we build will initially have a three degree 
of freedom hip, with the axes passing through a com- 
mon point, capable of leaning and twisting to any po- 
sition in about three seconds — somewhat slower than 
a human. The neck will also have three degrees of free- 
dom, with the axes passing through a common point 
which will also lie along the spinal axis of the body. 
The head will be capable of yawing at 90 degrees per 
second — less than peak human speed, but well within 
the range of natural human motions. As we build later 
versions we expect to increase these performance fig- 
ures to more closely match the abilities of a human. 

Apart from the normal sorts of kinematic sensors, 
the torso needs a number of additional sensors specifi- 
cally aimed at providing input fodder for the develop- 
ment of bodily metaphors. In particular, strain gauges 
on the spine can give the system a feel for its posture 
and the symmetry of a particular configuration, plus a 
little information about any additional load the torso 
might bear when an arm picks up something heavy. 
Heat sensors on the motors and the motor drivers will 
give feedback as to how much work has been done by 
the body recently, and current sensors on the motors 
will give an indication of how hard the system is work- 
ing instantaneously. 

Our zero-th level torso is roughly 18 inches from the 


base of the spine to the base of the neck. This corre- 
sponds to a smallish adult. It uses DC motors with 
built in gearboxes. The main concern we have is how 
quiet it will be, as we do not want the sound perception 
system to be overwhelmed by body noise. 

Later versions of the torsos will have touch sensors 
integrated around the body, will have more compliant 
motion, will be quieter, and will need to provide better 
cabling ducts so that the cables can all feed out through 
a lower body outlet. 

Arms 

The eventual manipulator system will be a compliant 
multi-degree of freedom arm with a rather simple hand. 
(A better hand would be nice, but hand research is not 
yet at a point where we can get an interesting, easy-to 
use, off-the-shelf hand.) The arm will be safe enough 
that humans can interact with it, handing it things 
and taking things from it. The arm will be compliant 
enough that the system will be able to explore its own 
body — for instance, by touching its head system — so 
that it will be able to develop its own body metaphors. 
The full design of the even the first pair of arms is not 
yet completely worked out, and current funding does 
not permit the inclusion of arms on the zero-th level 
humanoid. In this section, we describe our desiderata 
for the arms and hands. 

We want the arms to be very compliant yet still able 
to lift weights of a few pounds so that they can interact 
with human artifacts in interesting ways. Addition- 
ally we want the arms to have redundant degrees of 
freedom (rather than the six seen in a standard com- 
mercial robot arm), so that in many circumstances we 
can ‘burn’ some of those degrees of freedom in order 
to align a single joint so that the joint coordinates and 
task coordinates very nearly match. This will greatly 
simplify control of manipulation. It is the sort of thing 
people do all the time: for example, when bracing an 
elbow or the base of the palm (or even their middle 
and last two fingers) on a table to stabilize the hand 
during some delicate (or not so delicate) manipulation. 

The hands in the first instances will be quite simple; 
devices that can grasp from above relying heavily on 
mechanical compliance — they may have as few as one 
degree of control freedom. 

More sophisticated, however, will be the sensing on 
the arms and hands. We will use forms of conduc- 
tive rubber to get a sense of touch over the surface of 
the arm, so that it can detect (compliant) collisions it 
might participate in. As with the torso there will be 
liberal use of strain gauges, heat sensors and current 
sensors so that the system can have a TeeP for how its 
arms are being used and how they are performing. 

We also expect to move towards a more sophisticated 
type of hand in later years of this project. Initially, 
unfortunately, we will be forced to use motions of the 
upper joints of the arm for fine manipulation tasks. 
More sophisticated hands will allow us to use finger 


motions, with much lower inertias, to carry out these 
tasks. 

Development Plan 

We plan on modeling the brain at a level above the neu- 
ral level, but below what would normally be thought 
of as the cognitive level. 

We understand abstraction well enough to know how 
to engineer a system that has similar properties and 
connections to the human brain without having to 
model its detailed local wiring. At the same time it 
is clear from the literature that there is no agreement 
on how things are really organized computationally at 
higher or modular levels, or indeed whether it even 
makes sense to talk about modules of the brain (e.g., 
short term memory, and long term memory) as gener- 
ative structures. 

Nevertheless, we expect to be guided, or one might 
say inspired, by what is known about the high level 
connectivity within the human brain (although ad- 
mittedly much of our knowledge actually comes from 
macaques and other primates and is only extrapolated 
to be true of humans, a problem of concern to some 
brain scientists (Crick & Jones 1993)). Thus for in- 
stance we expect to have identifiable clusters of pro- 
cessors which we will be able to point to and say they 
are performing a role similar to that of the cerebel- 
lum (e.g., refining gross motor commands into coordi- 
nated smooth motions), or the cortex (e.g., some as- 
pects of searching generalization/specialization hierar- 
chies in object recognition (Ullman 1991)). 

At another level we will directly model human sys- 
tems where they are known in some detail. For in- 
stance there is quite a lot known about the control 
of eye movements in humans (again mostly extrapo- 
lated from work with monkeys) and we will build in a 
vestibulo-ocular response (VOR), OKR, smooth pur- 
suit, and saccades using the best evidence available on 
how this is organized in humans (Lisberger 1988). 

A third level of modeling or inspiration that we will 
use is at the developmental level. For instance once we 
have some sound understanding developed, we will use 
models of what happens in child language development 
to explore ways of connecting physical actions in the 
world to a ground of language and the development 
of symbols (Bates 1979), (Bates, Bretherton & Sny- 
der 1988), including indexical (Lempert & Kinsbourne 
1985) and turn-taking behavior, interpretation of tone 
and facial expressions and the early use of memorized 
phrases. 

Since we will have a number of faculty, post-doctoral 
fellows, and graduate students working on concurrent 
research projects, and since we will have a number 
of concurrently active humanoid robots, not all pieces 
that are developed will be intended to fit together ex- 
actly. Some will be incompatible experiments in al- 
ternate ways of building subsystems, or put ting them 
together. Some will be pushing on particular issues in 
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language, say, that may not be very related to some 
particular other issues, e.g., saccades. Also, quite 
clearly, at this stage we can not have a development 
plan fully worked out for five years, as many of the 
early results will change the way we think about the 
problems and what should be the next steps. 

In figure 1, we summarize our current plans for de- 
veloping software systems on board our series of hu- 
manoids. In many cases there will be earlier work 
off-board the robots, but to keep clutter down in the 
diagram we have omitted that work here. 
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